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Introduction 


FAUST : Functional AUdio STream 

A programming language for realtime signal processing 


Goals and Principles : 


Adequate Notation for Signal Processing 


9 Functional approach : A purely functional programming 
language for real-time signal processing 

9 Strong formal basis : A language with a well defined formal 
semantic 


Separation between Specification and Implementation 


9 Efficient compiled code : The generated C++ code should 
compete with hand-written code 

9 Easy deployment : Multiple native implementations from a 
single Faust program 
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FAUST Workflow 

The example of PD externals 
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New Code Generation Schemes 


FAUST Compiler Extension 

Up to FAUST 0.9.9.4 : scalar code only 
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FAUST Compiler Extension 

from FAUST 0.9.9.5 : vector code 
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FAUST Compiler Extension 

from FAUST 0.9.9.5 : parallel code 
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New Code Generation Schemes 

Scalar Compilation Scheme 


scalar code generator 


The Scalar Compilation Scheme generates a single 
sample-level computation loop. 
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New Code Generation Schemes 


Simple Example 

two 1-pole filters in parallel connected to an adder 


filter (c) = *(l-c) : + ~ * (c) ; 

process = filter(0.9), filter(0.9) : +; 


r process 


r filter(0.9) 































New Code Generation Schemes 


Simple Example 

Scalar Code Generation 


virtual void compute (int count, float** input, 

float** output) { 

float* inputO = input[0]; 
float* inputl = input[1]; 
float* outputO = output [0]; 
for (int i=0; i<count; i++) { 

fRecO[0] = (O.lf * inputl[i]) + (0.9f * fRecO[l]); 
fReel[0] = (O.lf * inputO[i]) + (0.9f * fRecl[l]); 
outputO [i] = (fRecl[0] + fRec0[0]); 

// post processing 
fReel[1] = fRecl[0]; 
fRec0[l] = fRec0[0]; 

} 
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New Code Generation Schemes 


Vector Compilation Scheme 


vector code generator 
(loop separation) 


The Vector Compilation Scheme simplifies the 
autovectorization work of the C++ compiler by splitting the 
sample processing loop into several simpler loops. 
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New Code Generation Schemes 


Simple Example 

Vector Code Generation 


// SECTION : 1 

for (int 1=0; i<count; i++) { 

fRecO[i] = (O.lf * input1[i]) + (0.9f * fRecO[i-1]); 

} 

for (int i=0; i<count; i++) { 

fRecl[i] = (O.lf * inputOfi]) + (0.9f * fRecl[i-l]); 

} 

// SECTION : 2 

for (int i=0; i<count; i++) { 

output0[i] = fRecl[i] + fRecO[i]; 

} 
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New Code Generation Schemes 


Parallel Compilation Scheme 


parallel code generator 
(OpenMP directives) 


The Parallel Compilation Scheme analyzes the dependencies 
between these loops and add OpenMP pragmas to indicate 
those that can be computed in parallel. 
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platforms and Windows NT platforms. 
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New Code Generation Schemes 


OpenMP 

Principle 


OpenMP is based on a set of compiler directives, library 
routines, and environment variables that influence run-time 
behavior in a fork-join model. 
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New Code Generation Schemes 


OpenMP 

Principle 


OpenMP is based on a set of compiler directives, library 
routines, and environment variables that influence run-time 
behavior in a fork-join model. 



#pragma omp parallel 


#pragma omp parallel 
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New Code Generation Schemes 


Simple Example 

Parallel Code Generation 


// SECTION : 1 
#pragma omp sections 
{ 

#pragma omp section 

for (int i=0; i<count; i++) { 

fRecO[i] = (O.lf * inputl[i]) + (0.9f * fRecO[i-l]); 

} 

#pragma omp section 

for (int i=0; i<count; i++) { 

fRecl[i] = (O.lf * input0[i]) + (0.9f * fRecl[i-1]); 

} 

} 

// SECTION : 2 

#pragma omp for 

for (int i=0; i<count; i++) { 

outputO[i] = (fRecl[i] + fRecO[i]); 
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Comparing Code Generation Schemes 


Methodology 

How do the scalar, vector and parallel code 
compare ? 


In order to compare the new vector and parallel code with the 
scalar code we have run 126 tests: 
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Comparing Code Generation Schemes 


Methodology 

How do the scalar, vector and parallel code 
compare ? 


In order to compare the new vector and parallel code with the 
scalar code we have run 126 tests: 

Q 7 FAUST examples 
O 3 code generations 
Q 2 compilers (gcc and icc) 

O 3 machines (2, 4 and 8 cores) 
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Comparing Code Generation Schemes 
Methodology 

What to measure ? 


The tests are based on a modified Alsa/GTK architecture 
aisa-gtk-bench. cpp that measures the duration of the 

compute () method : 
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Comparing Code Generation Schemes 
Methodology 

What to measure ? 


The tests are based on a modified Alsa/GTK architecture 
aisa-gtk-bench. cpp that measures the duration of the 

compute () method : 


TSC the duration is measured using the TSC (Time 
Stamp Counter) register. 

median A total of 128+2048 measures are made by run. 

The first 128 measures are considered a warm-up 
period and are skipped. The median value of the 
following 2048 measures is computed. 

MB/s This median value, expressed in processors 

cycles, is first converted in a duration, and then in 
number of mega-bytes produced per second 
(MB/s) considering the audio buffer size (in our 
test 2048) and the number of output channels. 
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Comparing Code Generation Schemes 
Methodology 

Code generations 


The tests are compiled with Faust 0.9.9.5b2 in three different 
settings : 
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Comparing Code Generation Schemes 
Methodology 

Code generations 


The tests are compiled with Faust 0.9.9.5b2 in three different 
settings : 

seal faust -a alsa-gtk-bench.epp test.dsp 
-o test.cpp 

vec faust -a alsa-gtk-bench.epp -vec -vs 
3968 test.dsp -o test.cpp 

par faust -a alsa-gtk-bench.epp -omp -vs 
3968 test.dsp -o test.cpp 
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Comparing Code Generation Schemes 
Methodology 

C++ Compiler used 


We have also used two different C++ compilers, GNU GCC and 
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We have also used two different C++ compilers, GNU GCC and 
Intel ICC : 

gcc version 4.3.2 with options : -03 -march=native 
-mfpmath=sse -msse -msse2 -msse3 
-ffast-math -ftree-vectorize. ( 
-fopenmp added for OpenMP). 
icc version 11.0.074 with options : -03 -xHost 
-ftz -fno-alias -fp-model fast=2. 
(-openmp is added for OpenMP). 
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Comparing Code Generation Schemes 
Methodology 

Machines used 


All the tests were run on three different machines : 

vaio a Sony Vaio SZ3VP laptop, with an Intel T7400 dual core 
processor at 2167 MHz, 2GB of Ram, running an Ubuntu 

7.10 distribution with a 2.6.22-15-generic kernel. 

xps a Dell XPS machine with an Intel Q9300 quad core 

processor at 2500 MHz, 4GB of Ram, running an Ubuntu 

8.10 distribution with a 2.6.27-12-generic kernel. 

macpro an Apple Macpro with two Intel Xeon X5365 quad core 

processors at 3000 MHz, 2GB of Ram, running an Ubuntu 

8.10 distribution with a 2.6.27-12-generic kernel 
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Comparing Code Generation Schemes 
Methodology 

Demo 

Using FAUST Graphic IDE 
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karplus.dsp 
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= *(1 - checkbox("m 


= input!1]; 

= input[ 2 ]; 
oat* input3 = input[3]; 
oat* inputs = input[4]; 

ost* input6 = input 16] | 
oat* input7 = input[7]; 

fRec2[0] * (fSlowO t (0.999f * fRec2[l])l; 

•float fTeirpO = (fslowl * (inputGCil * fRec2[0]l 
fReclto] = raax((fRecl[i] - fconstQ), fabsftfTemf 
fbargrapho = (20 * logl&f (maxffRecl[Q], -70))); 


■f Rec4[0] = 

fRec3[0] = 

fRec 6 [ 0 ] * 
fRecsto] = 

fRecBlO] = 
fRec7[0] = 


(fSlow4 t (0.999f * fftec4[ll)l; 

>2 = (fSlow5 * (input7(i) * fRec4[D]] 
raax((fRec3[i] - fconstQ), fabsf(fTem| 
= (20 * logiOf(4»x(fROC3[Q], -70])); 

(fSlo»8 * (0.999f * fFtecSll))); 

4 = (fSlow9 * (input5(il * fRecOtOl) 
/aax((fRec5[n] - fconstQ), fabsf(fTenf 
= (20* logi»f(max(fRecS[0], -70))); 

(fSlow!2 + (0.999f * fFtecBl 1] I ); 

16 = (fSlowl3 * (input4[il * fRec8[0) 
raax((fRec7[l] - fconstQ), fabsf (fTemf 
- (20* logiOf(max(fRoc7[Q], -70))); 

■ (fslowis + (0.999-f * fReclOtl])); 
a = (fslowi7 + finput3[il * fRecioto! 
raax((fRec9[li - fconstQ), fabsf (fTenf 
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Comparing Code Generation Schemes 
Results 

freeverb.dsp code 


monoReverb(fbl, fb2, damp, spread) 


<: comb(combtuningLl+spread, 

comb(combtuningL2+spread, fbl, 
comb(combtuningL3+spread, fbl, 
comb(combtuningL4+spread, fbl, 
comb(combtuningL5+spread, fbl, 
comb(combtuningL6+spread, fbl, 
comb(combtuningL7+spread, fbl, 
comb(combtuningL8+spread, fbl. 


fbl, damp), 
damp), 
damp), 
damp), 
damp), 
damp), 
damp), 
damp) 


: > 


allpass (allpasstuningLl+spread, fb2) 
allpass (allpasstuningL2+spread, fb2) 
allpass (allpasstuningL3+spread, fb2) 
allpass (allpasstuningL4+spread, fb2) 
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Comparing Code Generation Schemes 
Results 

freeverb.dsp results 



Compilation option 
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Comparing Code Generation Schemes 
Results 

karplus32.dsp code 


process = 

vgroup("karplus32", 

vgroup("noise u generator", 

noise * hslider("level", 0.5, 0, 1, 0.1) 

) 

: vgroup("excitator", 

*(button("play") : trigger(size)) 

) 

<: vgroup (" resonator l _ j x32 " , 

par(i,32, resonator(dur+i*detune, att) 

* (polyphony > i) 

) 

) 

:> *(output),*(output) 
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Comparing Code Generation Schemes 
Results 

karplus32.dsp results 



Compilation option 
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Comparing Code Generation Schemes 
Results 

mixer.dsp code 


import("music.lib"); 

smooth(c) = *(l-c) : +~*(c); 

vol = *(vslider("fader", 0, -60, 4, 0.1) 

: db21inear : smooth (0.99) ); 

mute = *(1 - checkbox("mute")); 

vumeter(x) = attach(x, env(x) : vbargraph("",0,1)) 
with{ env = abs:min(0.99):max ~ -(1.0/SR); }; 

pan = _ <: *(sqrt(1—c)), *(sqrt(c)) 

with{ c=(nentry("pan",0,-8,8,1)-8)/-16 : smooth(0.99);}; 

voice (v) = vgroup ("voice lj %v", 

mute : hgroup ("", vol : vumeter) : pan ); 
stereo = hgroup("stereo u out", vol, vol); 

process = hgroup("mixer", par(i,8,voice (i)) :> stereo); 
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Comparing Code Generation Schemes 
Results 

mixer.dsp results 



Compilation option 
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Comparing Code Generation Schemes 
Results 

fdelay8.dsp code 


import("filter.lib"); 


line(i) 
with { 

= vgroup (" line^,%i", fdelay5 (128,d) : * (g) ) 
g = vslider("gain^(dB)",-60,-60,4,0.1) 

: db21inear : smooth(0.995); 
d = nentry ("delay,_, (samp) " , 10, 10, 128, 0.1) 

: smooth(0.995); 

}; 



process = hgroup("", par(i, 8, line(i) 
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Comparing Code Generation Schemes 
Results 

fdelay8.dsp results 



Compilation option 
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Comparing Code Generation Schemes 
Results 

rms.dsp code 


// Square of a signal 
square(x) = x * x ; 

// Sliding sum of n consecutive samples 
integrate(n,x) = x - x@n : +~_ ; 

// Mean of n consecutive samples of a signal 
mean (n) = integrate (n) : / (n); 

// Root Mean Square of n consecutive samples 
RMS(n) = square : mean(n) : sqrt ; 

// Root Mean Square of 1000 consecutive samples 
process = RMS(1000) ; 
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Comparing Code Generation Schemes 
Results 

rms.dsp results 



Compilation option 
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Comparing Code Generation Schemes 
Results 

rms8.dsp code 


process = par(i,8,component("rms.dsp")) ; 


< a 


< □ 


• 0^0 



Comparing Code Generation Schemes 
Results 

rms8.dsp results 



Compilation option 
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Comparing Code Generation Schemes 
Overall comparison 

Speedup between vector and scalar code (icc) 



Test 
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Conclusion 


Conclusion 


Automatic parallelization is the way to go : 


Discovering the parallelism of a program is : 


9 Difficult on imperative programming languages like 
C/C++/Java/... 

9 Easy on purely functional programming languages 


Efficient parallelism on SMP machines is difficult 


9 The Memory bandwidth is a strong limit and SMP doesn’t 
scale very well 

9 Efficient cache aware scheduling is a key factor 


□ 
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Conclusion 


Conclusion 


What’s next ? 


Improve the scheduling of the parallel tasks 


O OpenMP 3.0 tasks 
O Intel TBB (ThreadingBuildingBlocks): 
http://wwww.threadingbuildingblocks.org 

Q Cilk Arts Cilk++: http://wwww.cilk.com 

O Develop a new scheduling algorithm (derived from work 
stealing schedulers) 
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